Multi-Model Routing 与成本优化

2026 年 GPT-5 / Claude Opus 4.5 / Gemini 3 Pro / 开源模型同时存在，每个模型成本差 100 倍。路由策略错了，一晚上能烧 10 万美金。本篇讲清楚生产 AI 应用的成本工程。

学前说明

回到 2023 年，AI 工程师只需要选 GPT-4 还是 GPT-3.5。今天选项爆炸：

顶级闭源（贵、强）：GPT-5、Claude Opus 4.5、Gemini 3 Pro
中端闭源（性价比）：GPT-5-mini、Claude Sonnet 4.5、Gemini 3 Flash
便宜闭源（快、便宜）：GPT-5-nano、Claude Haiku、Gemini 3 Flash Lite
推理特化：o3、Claude Sonnet Thinking、Gemini 3 Deep Think
开源（自部署）：Llama 4、Qwen 3、DeepSeek V3、GLM-4.6
专用模型：Codex、Cursor 小模型、SoTA Embedding...

每个模型成本差能到 100 倍（GPT-5 vs Haiku vs 开源自部署）。但质量差并不是 100 倍——很多任务用 Haiku 跟 Opus 几乎一样好。

不做路由 = 用 Opus 跑所有任务 = 烧钱。做错路由 = 简单任务用 Opus、复杂任务用 Haiku = 既烧钱又烂质量。

学习目标

区分 4 类路由策略（静态、规则、ML、LLM-as-Router）
设计 cost-aware fallback 链（贵模型挂了自动降级）
用 Helicone / Portkey / LiteLLM 这类 LLM Gateway
应用 Prompt Caching、Batch API、Embeddings Cache
评估 Token Compression 技术
区分推理路由（reasoning vs fast）的边界

与现有知识的衔接

03 Prompt 元编程：多模型 prompt 适配（前置）
5-9 LLMOps：成本归因（前置）
6-4 成本建模：商业层视角（前置）
18 AI for AI：用 AI 优化路由策略

第一章：成本的真实结构

1.1 一个真实的崩溃案例

某 SaaS 产品 2025 年 9 月上线 AI 功能，第一周账单：

预期：$500/月（基于压力测试）
实际：第 1 周 $8000

归因：
- 所有用户走 Opus  $5,000
- Prompt 没有 cache $2,000
- 重试无限循环      $800
- 工具调用 N+1     $200

修复后：

新策略：
- 90% 简单任务用 Haiku
- 10% 复杂任务用 Opus
- 所有 system prompt cache
- 重试有上限

第 2 周：$520

降本 15 倍，质量没显著下降。这就是 Multi-Model Routing 的真实价值。

1.2 成本的组成

totalCost = sum(modelCost) + sum(toolCost) + infrastructureCost

modelCost = inputTokens * inputPrice + outputTokens * outputPrice

// 2026 年中典型价格（per 1M tokens）
const prices = {
  'gpt-5':           { input: 15.00, output: 60.00 },
  'claude-opus-4-5': { input: 15.00, output: 75.00 },
  'gemini-3-pro':    { input: 12.50, output: 50.00 },
  'gpt-5-mini':      { input: 0.40, output: 1.60 },
  'claude-sonnet':   { input: 3.00, output: 15.00 },
  'gemini-3-flash':  { input: 0.30, output: 1.20 },
  'claude-haiku':    { input: 0.80, output: 4.00 },
  'gpt-5-nano':      { input: 0.10, output: 0.40 },
  'llama-4-self':    { input: 0.05, output: 0.20 },  // 自部署摊薄
};

// 同样一段对话，成本差 150 倍

1.3 关键洞察

1. Output token 比 input 贵 4-5 倍

→ 优化输出长度比优化输入长度更有效。

2. Cache 命中只算 10% 价格（Anthropic）/ 50%（OpenAI）

→ Caching 是免费午餐（参考 01 Context Engineering）。

3. 简单任务用强模型是浪费

→ 多数任务用 Haiku 级别完全够（路由的核心）。

4. 自部署不一定省钱

→ Llama 4 自己跑要算 GPU 成本，量小时比 API 还贵。

第二章：四类路由策略

2.1 策略 A：静态路由

最简单：每种任务硬编码用什么模型。

const modelMap = {
  classify: 'claude-haiku',
  summarize: 'claude-sonnet',
  code_review: 'claude-opus',
  chat: 'claude-sonnet',
};

function getModel(taskType: string) {
  return modelMap[taskType] ?? 'claude-sonnet';
}

适合：

任务类型少（<10）
边界清晰
起步阶段

不适合：

任务难度差别大（同样 "chat" 有简单有复杂）
用户可控的输入

2.2 策略 B：规则路由

基于输入特征做规则判断：

function route(query: string, context: Context): string {
  // 长度规则
  if (query.length < 50) return 'haiku';
  if (query.length > 5000) return 'opus';  // 长输入需要强理解
  
  // 内容类型规则
  if (containsCode(query)) return 'sonnet';
  if (containsMath(query)) return 'o3';  // 推理模型
  if (isQuestion(query)) return getQuestionDifficulty(query);
  
  // 用户类型规则
  if (context.userTier === 'free') return 'haiku';
  if (context.userTier === 'pro') return 'sonnet';
  if (context.userTier === 'enterprise') return 'opus';
  
  return 'sonnet';  // 默认
}

适合：

输入特征明显（长度、关键词）
用户分层
可以人工总结规则

2.3 策略 C：ML 路由

用小模型/分类器预测最优模型：

class MLRouter {
  private classifier: TextClassifier;
  
  async route(query: string): Promise<string> {
    // 用小模型预测"难度"
    const features = await this.classifier.predict(query);
    
    if (features.complexity < 0.3) return 'haiku';
    if (features.complexity < 0.7) return 'sonnet';
    return 'opus';
  }
}

训练数据：历史"task → 哪个模型答得好"的标注数据。

适合：

大量历史数据
复杂任务空间
团队有 ML 能力

2.4 策略 D：LLM-as-Router

用最便宜的模型做"路由判断"：

async function route(query: string): Promise<string> {
  const decision = await haiku.chat({
    messages: [{
      role: 'user',
      content: `判断这个问题需要的模型等级：
      
问题：${query}

输出 JSON：
{
  "complexity": "simple" | "medium" | "complex",
  "needs_reasoning": boolean,
  "needs_long_context": boolean
}`
    }]
  });
  
  const d = JSON.parse(decision.content);
  
  if (d.needs_reasoning) return 'o3';
  if (d.needs_long_context) return 'sonnet-200k';
  if (d.complexity === 'simple') return 'haiku';
  if (d.complexity === 'medium') return 'sonnet';
  return 'opus';
}

适合：

任务空间无法预测
规则总结不出来
接受额外一次 LLM 调用（Haiku 几乎免费）

2.5 选型决策

你的场景 →
  任务类型清晰且固定？
    是 → 静态路由（A）
    否 →
      输入特征明显？
        是 → 规则路由（B）
        否 →
          有大量历史数据？
            是 → ML 路由（C）
            否 → LLM-as-Router（D）

实际生产常组合：先规则筛掉明显简单的（B），不确定的让 LLM 判断（D）。

第三章：Cost-Aware Fallback

3.1 为什么需要 Fallback

单一模型不够：

模型 API 挂（Anthropic 偶尔有 30 分钟 down）
模型 rate limit
模型 refuse（safety 拒绝）
模型超时

没 fallback = 用户体验崩。

3.2 经典 Fallback 链

async function chatWithFallback(messages: Message[]) {
  const chain = [
    { model: 'claude-opus-4-5', timeout: 30_000 },
    { model: 'claude-sonnet-4-5', timeout: 30_000 },
    { model: 'gpt-5', timeout: 30_000 },          // 跨厂商
    { model: 'gemini-3-pro', timeout: 30_000 },
  ];
  
  for (const { model, timeout } of chain) {
    try {
      return await withTimeout(
        callModel(model, messages),
        timeout
      );
    } catch (error) {
      logger.warn(`Model ${model} failed, falling back`, { error });
      continue;
    }
  }
  
  throw new Error('All models failed');
}

设计原则：

跨厂商：避免单厂商整体挂（罕见但发生过）
降质量保活：从 Opus → Sonnet → Haiku 也行
超时严格：单个模型最多 30 秒
不无限重试：fallback 链跑完就抛错

3.3 智能 Fallback

不一定每次都从头走链。学习"哪个模型当前 healthy"：

class SmartFallback {
  private healthScore = new Map<string, number>();
  
  async call(messages: Message[]) {
    // 按健康度排序模型
    const sorted = [...this.healthScore.entries()]
      .sort((a, b) => b[1] - a[1])
      .map(([model]) => model);
    
    for (const model of sorted) {
      try {
        const result = await callModel(model, messages);
        // 成功：健康度 +0.1
        this.updateHealth(model, +0.1);
        return result;
      } catch (error) {
        // 失败：健康度 -0.3
        this.updateHealth(model, -0.3);
        continue;
      }
    }
    
    throw new Error('All failed');
  }
  
  updateHealth(model: string, delta: number) {
    const current = this.healthScore.get(model) ?? 1.0;
    this.healthScore.set(model, Math.max(0, Math.min(1, current + delta)));
  }
}

适合：

真正生产环境
多用户并发
厂商故障频繁

3.4 Fallback 时质量保证

降级到便宜模型时，可能答案质量低。两种处理：

模式 1：明确告知用户

return {
  answer: result.content,
  modelUsed: model,
  qualityNote: model.includes('haiku') ? '使用了快速模型，详细回答请重试' : null
};

模式 2：用 LLM-as-Judge 校验

const result = await fallbackToHaiku(messages);
const quality = await judge.evaluate(messages, result);

if (quality.score < 0.7) {
  // Haiku 答得不够好，重试 Sonnet
  return await callModel('sonnet', messages);
}
return result;

第四章：LLM Gateway

不要自己重复造轮子。2025-2026 成熟的 LLM Gateway：

4.1 主流 Gateway

工具	类型	特点
Helicone	托管	可观测性强，cache、cost 完善
Portkey	托管	路由策略丰富，企业级
LiteLLM	自部署 / Python proxy	开源，统一 API
OpenRouter	托管	一个 API key 用所有模型
Vercel AI Gateway	托管	集成 Vercel 生态
自建	自建	完全控制，需要工程投入

4.2 LiteLLM 实战

最广泛用的开源方案：

# 服务端：起 LiteLLM proxy
litellm --config config.yaml

# config.yaml
model_list:
  - model_name: gpt-5
    litellm_params:
      model: openai/gpt-5
      api_key: os.environ/OPENAI_KEY
      
  - model_name: claude-opus
    litellm_params:
      model: anthropic/claude-opus-4-5
      api_key: os.environ/ANTHROPIC_KEY
      
  - model_name: smart
    litellm_params:
      model: openrouter/auto  # OpenRouter 自动路由

router_settings:
  routing_strategy: simple-shuffle  # 或 cost-based-routing、latency-based
  fallbacks:
    - claude-opus: [claude-sonnet, gpt-5, gemini-3-pro]

客户端：

// 所有请求统一打到 LiteLLM，它代理到各家
const response = await fetch('http://localhost:4000/chat/completions', {
  method: 'POST',
  headers: { 'Authorization': 'Bearer sk-xxx' },
  body: JSON.stringify({
    model: 'smart',  // 让 gateway 决定
    messages: [...]
  })
});

优点：

应用代码只写一套（OpenAI 兼容协议）
切换模型改 config，不改代码
内置 fallback、cache、rate limit
开源可自部署

4.3 Helicone 实战

托管方案，加一行 baseURL 就接入：

const client = new OpenAI({
  apiKey: process.env.OPENAI_KEY,
  baseURL: 'https://oai.helicone.ai/v1',  // 改这里
  defaultHeaders: {
    'Helicone-Auth': `Bearer ${process.env.HELICONE_KEY}`,
    'Helicone-Cache-Enabled': 'true',  // 自动 cache
    'Helicone-User-Id': userId,  // 用户归因
  }
});

// 然后正常用 OpenAI API
const response = await client.chat.completions.create({...});

Helicone 自动：

记录所有调用
按用户 / feature 归因成本
显示 dashboard
Cache 命中省钱

4.4 自建 vs 托管

维度	自建	托管
数据隐私	高	中（数据经第三方）
起步速度	慢	快
长期成本	低	中（按量）
维护	需要	不需要
定制	完全	受限

建议：

初创 / MVP：用 Helicone / Portkey
中型 / 数据敏感：LiteLLM 自部署
大型 / 极高合规：完全自建

第五章：Prompt Caching 工程化

参考 01 第五章，这里补深度。

5.1 Caching 的真实收益

对话场景：
- 不 cache：每轮 5K system + 累积 history → 成本随轮数线性增长
- Cache system：系统提示部分命中 → 90% 折扣（Anthropic）
- Cache history：长对话整体几乎不增加成本

实际数据（生产案例）：

之前（无 cache）：$0.18 / 用户 / 小时
之后（cache system + RAG）：$0.04 / 用户 / 小时
降本 77%

5.2 Cache 策略设计

Anthropic 用 cache_control 标记：

const messages = [
  {
    role: 'system',
    content: [
      {
        type: 'text',
        text: longSystemPrompt,  // 几千字
        cache_control: { type: 'ephemeral' }  // 标记缓存
      }
    ]
  },
  {
    role: 'user',
    content: [
      {
        type: 'text',
        text: ragContext,  // 长 RAG 内容
        cache_control: { type: 'ephemeral' }
      },
      {
        type: 'text',
        text: userQuery  // 动态部分，不 cache
      }
    ]
  }
];

OpenAI 自动 cache 前缀：

不用标记，自动检测前缀 cache（>1024 tokens）。

5.3 Cache 友好的应用架构

设计原则：

[静态部分 - 一次定义，长期不变] - 必 cache
[半静态 - 用户/会话级别] - 应 cache
[动态 - 每次不同] - 不 cache

错误结构：

system: "Today is 2026-06-12. You are helpful..."  
// ❌ 时间戳在前，每天 cache 失效

正确：

system: "You are helpful..."  // 静态，cache
user: "Today: 2026-06-12. Question: ..."  // 动态部分往后放

5.4 Cache TTL 与失效

Anthropic：5 分钟 TTL（短）。 OpenAI：自动管理。

如何续 cache：

5 分钟内有请求 = TTL 续命
长期 cache 需要 Beta（1 小时 TTL，要申请）

设计：高频请求自然续 TTL，低频请求 cache 命中率低，要权衡。

第六章：批处理（Batch API）

6.1 什么是 Batch API

非实时任务：丢一批请求进去，等 24 小时（或更短），结果一起返回。价格 50% off。

OpenAI、Anthropic、Gemini 都有。

// OpenAI Batch
const batch = await openai.batches.create({
  input_file_id: 'file_xxx',  // 上传一堆请求
  endpoint: '/v1/chat/completions',
  completion_window: '24h'
});

// 24 小时内完成，结果存到 output_file

6.2 适用场景

离线评测（评 1000 个 test case）
批量生成（一次性写 100 个产品描述）
数据 enrichment（给 10 万条数据打标签）
训练数据生成
Embedding 批量计算

不适用：

实时用户请求
延迟敏感场景

6.3 Batch + 实时混合

很多场景能拆分：

async function summarizeDoc(docId: string, userId: string) {
  // 1. 先看 cache
  const cached = await cache.get(`summary:${docId}`);
  if (cached) return cached;
  
  // 2. 实时调用（用户在等）
  const summary = await callRealtime(docId);
  
  // 3. 异步：把 doc 加入 batch 队列，做"详细分析"（不急）
  await batchQueue.add({
    docId,
    task: 'deep_analysis',
    callback: 'webhook'
  });
  
  return summary;
}

6.4 真实成本对比

任务：每天 10 万次摘要生成

实时 + Opus：
- 10 万 × $0.05 = $5000/天

实时（简单走 Haiku）+ Batch 详细分析（Sonnet）：
- 80% 简单 × 10万 × $0.003 = $240
- 20% 详细 走 batch × 10万 × $0.015 × 0.5 = $150
- 总：$390/天

降本 92%

第七章：Embedding 与 RAG 成本

7.1 Embedding 的成本结构

Embedding 调用：每 1K tokens
- OpenAI text-embedding-3-large: $0.13
- OpenAI text-embedding-3-small: $0.02
- Voyage AI: $0.10
- 自部署 BGE: GPU 摊薄

向量存储：
- Pinecone: $70/月起
- pgvector: 免费（Postgres）
- Qdrant: 自部署免费

检索：
- 每次查询 1 次 embedding + 1 次 vector search

7.2 Embedding Caching

文档不变就不要重 embed：

async function getEmbedding(text: string): Promise<number[]> {
  const hash = sha256(text);
  
  // 检查缓存
  const cached = await embeddingCache.get(hash);
  if (cached) return cached;
  
  // 调 API
  const embedding = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: text
  });
  
  // 缓存（永久或长 TTL）
  await embeddingCache.set(hash, embedding, { ttl: 30 * 24 * 3600 });
  
  return embedding;
}

效果：文档库初次 embed 完，之后近 0 成本。

7.3 RAG 整体优化

async function rag(query: string) {
  // 1. Query embedding cache
  const queryEmbedding = await getCachedEmbedding(query);
  
  // 2. 向量检索（限 top K）
  const docs = await vectorDB.search(queryEmbedding, { topK: 5 });
  
  // 3. Rerank（用便宜 reranker，不用 LLM）
  const reranked = await rerank(query, docs);
  
  // 4. 只把 top 3 给 LLM
  const context = formatDocs(reranked.slice(0, 3));
  
  // 5. 用 Sonnet 而不是 Opus（多数 RAG 任务 Sonnet 够）
  return await sonnet.chat({ messages: [..., { context, query }] });
}

每一步都是成本优化点。

第八章：Token Compression

8.1 输入压缩

Prompt 太长 = 贵。压缩方法：

A. 删除冗余

// 反例：详细解释 + 复杂示例
const verbosePrompt = `
You are a helpful assistant. 
Please be polite and respectful.
Always provide accurate information.
Format your responses clearly.
... (3000 字描述)
`;

// 正例：精炼
const concisePrompt = `Helpful assistant. Be polite, accurate, clear.`;

B. LLM 压缩

用便宜 LLM 把长 prompt 压缩：

const compressed = await haiku.chat({
  messages: [{
    role: 'user',
    content: `Compress this prompt to 30% length, keep meaning:\n\n${longPrompt}`
  }]
});

测试压缩后效果不变再用。

C. LLMLingua 等专门库

LLMLingua（微软）用小模型压缩 prompt，论文报告 20x 压缩，质量损失 <5%。

from llmlingua import PromptCompressor

compressor = PromptCompressor()
compressed = compressor.compress_prompt(
    long_prompt,
    rate=0.3  # 压到 30%
)

实际效果：

简单任务：压 70% 几乎无损
复杂任务：压 50% 开始有损
推理任务：不要压

8.2 输出限制

A. max_tokens 严格

const response = await llm.chat({
  messages: [...],
  max_tokens: 200,  // 不要默认 4096
});

B. 结构化输出

// 长文本
"用户最近 30 天活跃，每周下单 2-3 次，喜欢电子产品..."

// 结构化（短）
{ "active": true, "weekly_orders": 2.5, "category": "electronics" }

少 70% token。

C. 引导短回答

const prompt = `${question}\n\n用一句话回答。`;
// vs 模型默认 3-4 段

8.3 历史压缩

参考 01 第四章。长对话用滚动摘要。

第九章：模型组合策略

9.1 分层处理

Tier 1（Haiku 级）：意图分类、初筛、简单回答
  ↓ 如果需要更深
Tier 2（Sonnet 级）：常规对话、工具调用、RAG
  ↓ 如果需要更深
Tier 3（Opus 级）：复杂推理、关键决策
  ↓ 如果需要专门
Tier 4（o3 等）：数学、代码、长链推理

async function smartChat(query: string) {
  // 1. 意图分类（Haiku，几乎免费）
  const intent = await haiku.classify(query, ['chitchat', 'task', 'reasoning']);
  
  if (intent === 'chitchat') {
    return await haiku.chat({ messages: [...] });  // 闲聊用 Haiku
  }
  
  if (intent === 'task') {
    return await sonnet.chat({ messages: [...] });  // 任务用 Sonnet
  }
  
  if (intent === 'reasoning') {
    return await o3.chat({ messages: [...] });  // 推理用 o3
  }
}

9.2 Generator-Verifier 模式

便宜模型生成 + 贵模型校验：

async function generateWithVerify(query: string) {
  // Haiku 快速生成
  const draft = await haiku.chat({ messages: [{ content: query }] });
  
  // Opus 校验质量
  const verdict = await opus.chat({
    messages: [{
      content: `Question: ${query}\nDraft answer: ${draft}\n\nIs this answer correct and complete? If not, what's missing?`
    }]
  });
  
  if (verdict.includes('correct')) {
    return draft;  // 用便宜的
  }
  
  // 不行才用 Opus 重写
  return await opus.chat({ messages: [{ content: query }] });
}

效果：

70% 任务 Haiku 答对，省钱
30% 任务 Opus 重做，质量保证

平均成本：~30% of all-Opus。

9.3 Speculative Decoding（推理时）

用小模型快速生成，大模型验证：

小模型并行生成 N 个 token
  ↓
大模型一次验证这 N 个 token
  ↓
正确的接受，错的让大模型重新生成

适合：自部署场景。OpenAI / Anthropic API 用户拿不到这个能力，但 vLLM、SGLang 支持。

第十章：成本归因与监控

参考 5-9 章。这里补 routing 视角。

10.1 必须监控的指标

interface CostMetrics {
  // 按时间
  daily_cost: number;
  hourly_cost: number;
  
  // 按维度
  cost_by_user: Map<string, number>;
  cost_by_feature: Map<string, number>;
  cost_by_model: Map<string, number>;
  
  // 效率
  cache_hit_rate: number;       // 目标 > 60%
  fallback_rate: number;         // 异常 > 5%
  avg_cost_per_request: number;
  
  // 路由
  routing_distribution: Map<string, number>;  // 各模型占比
}

10.2 异常告警

const alerts = [
  {
    name: 'cost_spike',
    condition: (m) => m.hourly_cost > m.avg_hourly_cost * 3,
    action: 'page_oncall'
  },
  {
    name: 'cache_miss_high',
    condition: (m) => m.cache_hit_rate < 0.3,
    action: 'investigate_routing'
  },
  {
    name: 'fallback_spike',
    condition: (m) => m.fallback_rate > 0.1,
    action: 'check_model_health'
  },
  {
    name: 'user_burning',
    condition: (m) => Math.max(...m.cost_by_user.values()) > 10,  // 单用户单日 $10
    action: 'rate_limit_user'
  }
];

10.3 优化迭代周期

每周：
- 看 top 10 高成本用户/功能
- 看路由分布是否合理
- 看 cache 命中率

每月：
- A/B 测试新路由策略
- 评估新模型加入 fallback 链
- 评估 Batch API 适配

每季度：
- 重新校准模型选型（新模型出来了）
- 重新评估自建 vs API
- 整体架构 review

第十一章：实战例子

11.1 例：客服 SaaS 的路由

async function customerServiceRoute(query: string, user: User) {
  // 1. 紧急投诉走 Opus（质量重要）
  if (await isComplaint(query)) {
    return await opus.chat({ messages: [...] });
  }
  
  // 2. FAQ 走 cache
  const cached = await semanticCache.get(query);
  if (cached) return cached;
  
  // 3. 普通问答看用户等级
  const model = user.tier === 'enterprise' ? 'opus'
              : user.tier === 'pro' ? 'sonnet'
              : 'haiku';
  
  const response = await llm[model].chat({ messages: [...] });
  
  // 4. 写回 cache
  await semanticCache.set(query, response, { ttl: 24 * 3600 });
  
  return response;
}

11.2 例：Coding Agent 路由

async function codingTaskRoute(task: CodingTask) {
  // 1. Planning 用 Opus（一次调用，重要）
  const plan = await opus.chat({
    messages: [{ content: `Plan this task: ${task.description}` }]
  });
  
  // 2. 每个 step 的实现用 Sonnet（多次调用，性价比）
  const results = await Promise.all(
    plan.steps.map(step => 
      sonnet.chat({ messages: [{ content: `Implement: ${step}` }] })
    )
  );
  
  // 3. 最终 review 用 Opus（关键质量门）
  return await opus.chat({
    messages: [{ content: `Review: ${JSON.stringify(results)}` }]
  });
}

参考 05/07 章的 Architect → Implementor 模式，天然支持多模型路由。

11.3 例：Embedding 服务

class EmbeddingService {
  async embedBatch(texts: string[]) {
    // 1. 查缓存
    const results = new Array(texts.length);
    const toEmbed: { index: number; text: string }[] = [];
    
    for (let i = 0; i < texts.length; i++) {
      const cached = await this.cache.get(this.hash(texts[i]));
      if (cached) {
        results[i] = cached;
      } else {
        toEmbed.push({ index: i, text: texts[i] });
      }
    }
    
    if (toEmbed.length === 0) return results;
    
    // 2. 一次 API call 多个（OpenAI 支持 batch input）
    const newEmbeddings = await openai.embeddings.create({
      model: 'text-embedding-3-small',  // 用小的够用
      input: toEmbed.map(t => t.text)
    });
    
    // 3. 写回缓存 + 结果
    for (let i = 0; i < toEmbed.length; i++) {
      const embedding = newEmbeddings.data[i].embedding;
      await this.cache.set(this.hash(toEmbed[i].text), embedding);
      results[toEmbed[i].index] = embedding;
    }
    
    return results;
  }
}

第十二章：踩坑总结

12.1 路由层

坑	后果	修正
一刀切用 Opus	烧钱	加路由
一刀切用 Haiku	质量差	加路由
路由 LLM 比业务 LLM 还贵	没省到	路由用 Haiku 级
路由错了不可见	不知道在用错模型	日志 + dashboard
没 fallback	厂商挂用户体验崩	跨厂商 fallback

12.2 Cache 层

坑	后果	修正
不 cache system prompt	浪费	必 cache
时间戳放 system	cache 永失效	移到 user
cache key 设计差	命中率低	hash 整个 prompt
不监控命中率	不知道有没有用	加 metrics

12.3 Batch 层

坑	后果	修正
实时任务硬塞 batch	用户等 24 小时	分清楚
Batch 队列堆积	任务永远不跑	监控积压
没失败处理	部分任务丢	重试机制

12.4 文化层

坑	表现	修正
工程师"反正不是我的钱"	默认用最贵模型	工程师能看到自己功能的成本
成本归因不到团队	谁烧钱不清楚	tag by team/feature
节省成本被认为低端	没人优化	把"成本"作为绩效维度

第十三章：未来方向

13.1 动态定价

模型 API 价格不再静态。Anthropic 在测试根据负载动态定价。路由策略要适应。

13.2 模型即服务的更细粒度

不只 Haiku / Sonnet / Opus。可能出现：

按延迟付费（要快，付更多）
按质量付费（要更好答案，付更多）
按上下文付费（要长 context，付更多）

13.3 路由模型专门化

专门为"路由"训练的小模型，比通用 Haiku 更准更便宜。

13.4 自动优化路由

LLM-as-Optimizer：让 AI 自己分析"哪些 query 用哪个模型最好"，自动更新路由策略。参考 18 章 AI for AI。

13.5 成本可视化原生支持

未来 IDE / API 内置实时成本显示：

// 写代码时 IDE 显示
const response = await openai.chat.create({...});
// ↑ 估算成本: $0.012 per call, $360/month at 1000 calls/day

让工程师"感受到"成本。

权威资料

Helicone
Portkey
LiteLLM
OpenRouter
Vercel AI Gateway
Anthropic Prompt Caching
OpenAI Batch API
LLMLingua（微软）
01 Context Engineering（Caching 前置）
03 Prompt 元编程（多模型适配前置）
5-9 LLMOps（成本归因前置）
6-4 成本建模（商业前置）

核对日期：2026-06-12

学前说明​

学习目标​

与现有知识的衔接​

第一章：成本的真实结构​

1.1 一个真实的崩溃案例​

1.2 成本的组成​

1.3 关键洞察​

第二章：四类路由策略​

2.1 策略 A：静态路由​

2.2 策略 B：规则路由​

2.3 策略 C：ML 路由​

2.4 策略 D：LLM-as-Router​

2.5 选型决策​

第三章：Cost-Aware Fallback​

3.1 为什么需要 Fallback​

3.2 经典 Fallback 链​

3.3 智能 Fallback​

3.4 Fallback 时质量保证​

第四章：LLM Gateway​

4.1 主流 Gateway​

4.2 LiteLLM 实战​

4.3 Helicone 实战​

4.4 自建 vs 托管​

第五章：Prompt Caching 工程化​

5.1 Caching 的真实收益​

5.2 Cache 策略设计​

5.3 Cache 友好的应用架构​

5.4 Cache TTL 与失效​

第六章：批处理（Batch API）​

6.1 什么是 Batch API​

6.2 适用场景​

6.3 Batch + 实时混合​

6.4 真实成本对比​

第七章：Embedding 与 RAG 成本​

7.1 Embedding 的成本结构​

7.2 Embedding Caching​

7.3 RAG 整体优化​

第八章：Token Compression​

8.1 输入压缩​

8.2 输出限制​

8.3 历史压缩​

第九章：模型组合策略​

9.1 分层处理​

9.2 Generator-Verifier 模式​

9.3 Speculative Decoding（推理时）​

第十章：成本归因与监控​

10.1 必须监控的指标​

10.2 异常告警​

10.3 优化迭代周期​

第十一章：实战例子​

11.1 例：客服 SaaS 的路由​

11.2 例：Coding Agent 路由​

11.3 例：Embedding 服务​

第十二章：踩坑总结​

12.1 路由层​

12.2 Cache 层​

12.3 Batch 层​

12.4 文化层​

第十三章：未来方向​

13.1 动态定价​

13.2 模型即服务的更细粒度​

13.3 路由模型专门化​

13.4 自动优化路由​

13.5 成本可视化原生支持​

权威资料​